Generating Eeveelution Point Clouds With Point Set Generation Network

Starter Project by Vadim Kudlay

This project is made in response to the following starter project option with some slight modifications:

Generate your own dataset of 3D shapes and train a neural network to generate them. Skills demonstrated: 3D deep learning, data wrangling, 3D geometry This is akin to the first project, but a bit more involved: instead of classifying two types of procedural shapes, you should write a procedure that generates just one type of procedural shape and then train a deep generative model to mimic the behavior of this procedural model. There are many possible 3D generative architectures you could go with: 3DCNNS, point set generating networks, or implicit field generators are good places to start.

I thought the previous project on making a PokéGAN was pretty fun, so I wanted to use a set of related Pokémon for the project. For those unfamiliar with Pokémon, Eevee is a normal-type Pokémon species introduced in the first generation of the franchise. It is famous for having a lot of evolutions, or "Eeveelutions," with different elemental types (i.e. Flareon is a fire type evolved up using a fire stone, Vaporeon is water, etc.). They are all pretty similar in design and can be considered a logical class both in body shape and otherwise.

I have retrieved copies of the models used in Pokemon X/Y (Nintendo, 2013) and would like to create a network that can be trained to generate them. To do this, I use one of the approaches discussed in Learning Representations and Generative Models for 3D Point Clouds (Achlioptas et al. 2018) for learning and generating point cloud representations of 3D models. This is mostly done because, honestly, it looked relatively simple and also because I couldn't get a good grasp of dealing with voxelized components in pyMesh to the extent as was necessary to replicate the IM-NET implicit field decoder experiments.

I was slightly limited in the libraries I have access to since I am using an M1 Mac, but I was able to do just fine with TensorFlow 2.5 and pyMesh.

Eevee_pics.png

Premise

For this project, we would like to create a generator that generates Eeveelutions. As such, we need to get some models for those. The models below have been sourced from Models-Resource.com for non-commercial use and have been converted to non-textured OBJ files.

We have defined most of our utilities in auxiliary files so that we don't clutter up the notebook. Below is a quick demonstration of how to visualize some of the components that we will be interested in.

For our consideration, we are going to use the following models located in /data/objs. Of note, I did not use Sylveon because 9 is a weird number unless we go with a 3x3 grid and also because I do not expect the t-posing ribbon-arms to be interesting to look at.

To get a more even sample of the points, we will voxelize the meshes with a conservative precision and sample from the voxel vertices. We'll go ahead and create some convenient sample stashes for quick reference. Of course we'll have to generate more samples later, but we will not need to re-pull or re-voxelize the images throughout the process (unless we want to re-voxelize at different precisions to get even more point variation, which I don't think will be necessary).

To augment the dataset, we will define an augmentation function which is capable of rotating and scaling the vertices. This will also always map to the [0,1] unit space. For convenience, we will not define an inverse transformation function because we can easily reason on the compacted scale.

Of note, we will not be using the scaling and rotating functionality at all (which is unfortunate since I already put it in) and will just be using it to scale down to unit space. Forming good transformation-invariant features seems to be out of reach given my compute limitations.

We will now define a testing and validation dataset... which we will never use for its intended purpose. Instead, we will use a generating function which will generate new augmented samples automatically. We'll keep DS in there just because it's easy - if unnecessary - to pull from the "validation set" for some preprocessed inputs for our visualizations.

We will also create a convenient visualization function to see the execution of the model while it is running. Don't worry if this looks out of place; it really just goes along with the respective model code in terms of structure, but I wanted it to be handled by the logging class and also wanted easy plotting re-specification in the notebook.

Running the Autoencoder

We have created an autoencoder specification in cae.py. We have also specified some loss functions to use, and log results in an auto-generated directory corresponding the the StatusTracker name. A rudementary checkpoint system is in place to where hopefully the model training can be resumed if something unfortunate happens. It's not bullet-proof, but I guess it's better than nothing...

The model specifications found in AE_DefaultModel reflect a best-effort attempt to replicate the structure in the paper. As you may have seen earlier, we will only be using 1024 point samples for our model instead of the recommended 2048. This is for the sake of compute time, since I would not like for these models to run for too long. The architecture has been compensated to reflect this (see AE_DefaultModel implementation).

The loss function used is based on Chamfer distance as specified in the paper (sources cited in code): $$\mathcal{L}_{CH} = \sum_{x \in X}\min_{y \in h(X)}||x-y||^2_2 + \sum_{y \in h(X)}\min_{x \in X}||x-y||^2_2$$

Below is a training run of the model. I have already trained it up to epoch 400. The reason for training up to 400 is to keep the time commitment reasonable and also because the marginal benefit gets pretty bad after a certain point (from experimentation). I can believe the paper when is recommends 1000+ epochs, but I've had to rerun this experiment quite a few times to play around with the settings, so 400 is a more manageable number. Also, the batch size and number of point clouds generated per category have been reduced to 20 in both cases. This is to limit resource use and since the benefit seems to be marginal from past runs.

Training to 400 epochs using my code generates a lot of output by default, so I will leave the code in and only show the last epoch of training. More visualizations can be found in the associated log folder.

From this model, a specific preprocessed example can be encoded and decoded as follows:

Though possibly redundant given the visualizations coming out of the training loop, below are all of the classes and the model's attempt at reconstructing them through the bottleneck. Here, it will include the original vertices from the voxelization, a cluster of five 1024-point input examples, their corresponding predictions, and the predictions of just the first sample (for comparison).

As we can see from visual inspection, the generated images don't actually vary much. This uses some pretty thick points, but we can see from landmark points that the point distributions between the combined- and single-prediction plots are pretty much identical. This is to be expected, as the network was supposed to high-level attributes from the point cloud samples, transfer them through the bottleneck, and reconstruct an output based on them.

Training the $\ell$-Gan

The paper discussed several techniques for generating convincing images from random noise. One such technique was to train a GAN (generative adversarial network) on the latent bottleneck features of the autoencoder. In doing so, it can learn to generate good latent features within a tighter specification (relative to just throwing in random latent features and hoping they work well for the decoder). I'm not really sure why another autoencoder couldn't have been used, but I might as well replicate it for practice.

Below, we define the visualization function and a modified generator for the training examples (which just piggybacks from the previous generator):

Running The Model

The paper specifically recommended a WGAN implementation with a very shallow architecture. Note that for the purposes of generality, I do occasionally label the generator-apposing network as 'descriminator' instead of 'critic'; I do acknowlege that critic is a more accurate term given its purpose. I used a WGAN-GP gradient penalty to enforce a Lipschitz constraint on the critic as recommended, the code for which I repurposed from the Coursera course Generative Deep Learning with TensorFlow (DeepLearning.AI) - specifically from this notebook - because it allows easy switching to the DRAGAN gradient penalty in case we need it.

I attempted to follow the instructions per the paper and implemented it using a small number of dense layers instead of convolutional layers (I'm not sure if that's the best idea, but I'm just going with it). Some slight deviations were made for the sake of runtime. Specifically, I implemented the generator $g$ and critic $c$ as follows:

The two models will be trained in lockstep and once per epoch, though this can be specified using the train_counts parameter.

Again, we will pre-run the training to keep the notebook less verbose. More data can be found in the respective log file. As the GAN takes in 2 features as input, the output shows what happens as the two features vary in increments of 0.25.

As we can see, the GAN was able to train up to generate a selection of valid latent inputs using only 2 input features to represent 4 image categories. I guess that means that it works? I'm sure it could be trained up to generate on a wider output range (and adding another dimension to the GAN input space would probably help), but I don't think it would be necessary for this exercise.

Using a Variational Autoencoder

The paper suggested that a VAE would not work as well in the general case due to over-regularization. Still, I wanted to go ahead and try it out. I tried to keep the overall structure pretty simple to keep it in line with the AE for comparison (and to save running time for replication), so I kept the number of trainable parameters below 10 million. I tried using convolutional layers on both sides

As before, here is the visualization function:

Trying The Model

Below is the instantiation of the model. We will use just a basic variational autoencoder with normally-distributed sampling. We will not do anything special like disentangling or including any special loss terms.

Specifically the model consists of:

As before, the Chamfer distance loss function will be used for the reconstruction loss and 400 epochs will be used to train the network. For the reconstruction loss we will use the Kullback–Leibler (KL) divergence loss - specifically, a variation (equation 10) as described in the DeepLearning.ai course - on the means $\mu$ and deviations $\sigma$: $$\mathcal{L}_{KLD} = \frac{1}{2}\sum_{i}(1 + \ln(\sigma_i^2) - \mu_i^2 - \sigma_i^2)$$

Note that since the code uses variance, the following modified internal function are actually used: $1 + \sigma_i - \mu_i^2 - e^{\sigma_i}$.

I just wanted to mention that I did try it. At the end of the day, I'm not going to include the results since it ended up converging much slower and didn't seem to approach the same level of reconstruction quality within a reasonable epoch count (at least for my setup). It did train fast and generate really nice reconstructions when only using the reconstruction loss and the resulting mean/variance encodings could be interpolated between for nice class transitions on decoding. Still, using this would largely deplete the benefit of "just give a random normal latent vector of the right shape and it'll be fine" allure.

Interpolating between class samples

I thought that it would be interesting to see how the network would perform when we asked it to transition from one image category to another. In the case of the AE model, it would be to translate between latent representations using a strategy such as euclidean straight-line.

Autoencoder

Training an AE with Targetted Generation

I thought it would be interesting to see what would happen if instead of using an auto-encoder to reconstruct the point-clouds, I try using an auto-encoder to convert a point-cloud to a new class. Of course, doing it naively would be pretty bad, but I wanted to see what would happen.

For a first run, I tried just to train an auto-encoder that took in an Eevee and a one-hot encoding of the target class. The origin class point-cloud was obviously passed in through the decoder, but the destination encoding was passed directly to the decoder by appending to the latent bottleneck vector; this was a design choice that I thought would possibly make the approach more flexible later. I used a similar architecture as before and just ran it to see what happened. I intentionally left the batch size and sample size very small because the results of this really won't matter.

We're gonna go ahead and cut the run short. This run is pretty useless, but there was benefit to running it. As you can see, the decoder could reconstruct the output from the one-hot encoder alone, as can be seen with this example when the point cloud input is switched to that of the target class:

Still, this shows us that the decoder is sufficiently deep to be able to store such relationships, even if only memorizing them.

Specifying A More Interesting Model

From here, let us try to actually define an evolutionary behavior. We'll say that to evolve from an origin class to a destination class, the transformation feature vector will show a decrease in the origin class feature and an increase in the destination class feature.

We will define the new generator as follows, which will now include a transformation vector. This transformation vector will have a 0 for the origin class, a 1 for the destination class, and 0.5 for all other classes.

I noticed a pretty concerning trend early on where the network learns to use only the transformation encoding to decide how the output should be. This can be seen by the fact that the identity transformations are all the same while the non-identity transformations get better. Instead of waiting to see if it became a problem, I decided to try to address this concern early on by using two techniques:

We'll have to redefine everything in TAE2 which we will use below:

Playing around with the model

We're gonna take a look at the behavior to figure out what kinds of things this model managed to learn. First, we're just going to define a simple function to get the desired transformation feature vector:

Now, let's just see what happens when we progressively transition an input point-cloud to some other target classes...

As we can see, the network managed to both reconstruct the output images as expected while also having enough flexibility to get some nice shape mixtures in the process.

In contrast, we can also take a look at coming from a very busy and noisy point cloud profile such as that of a Jolteon and see that the results are pretty similar!

Testing some edge cases:

Lastly, I wanted to see what would happen if we changed the input point cloud class while keeping the transformation vector the same. As we can see, the end result actually did deviate based on the sample input; the tail varied slightly based on the origin point cloud, and the jolteon origin point cloud resulted in a very jolteon-like result.

Challenges During Implementation

Throughout this experiment, multiple difficulties stifled progress. Since this is supposed to be a discussion of how it went along, I thought it would be useful to include this discussion.

Main References